ETSITS126 094V3.0.0 



(2000-01) 



Technical Specification 



Universal Mobile Telecommunications System (UMTS); 

Mandatory Speech Codec speech processing functions 

AMR speech codec; Voice Activity Detector (VAD) 

(3G TS 26.094 version 3.0.0 Release 1999) 



33f(? 




(3G TS 26.094 version 3.0.0 Release 1 999) 1 ETSI TS 1 26 094 V3.0.0 (2000-01 ) 



Reference 



DTS/TSGS-0426094U 
Keywords 



UMTS 



£75/ 

Postal address 



F-06921 Sophia Antipolis Cedex - FRANCE 

Office address 

650 Route des Lucioles - Sophia Antipolis 

Valbonne - FRANCE 

Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 

Siret N°348 623 562 00017 - NAF 742 C 

Association a but non lucratif enregistree a la 

Sous-Prefecture de Grasse (06) N" 7803/88 



Internet 



secretariat@etsi.fr 

Individual copies of this ETSI deliverable 

can be downloaded from 

http://www.etsi.org 

If you find errors in the present document, send your 

comment to: editor@etsi.fr 



Important notice 



This ETSI deliverable may be made available in more than one electronic version or in print. In any case of existing or 

perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF). 

In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network 

drive within ETSI Secretariat. 



Copyright Notification 

No part may be reproduced except as authorized by written permission. 
The copyright and the foregoing restriction extend to reproduction in all media. 



I European Telecommunications Standards Institute 2000. 
All rights reserved. 



£75/ 



(3G TS 26.094 version 3.0.0 Release 1 999) 2 ETSI TS 1 26 094 V3.0.0 (2000-01 ) 



Intellectual Property Rights 



IPRs essential or potentially essential to the present document may have been declared to ETSI. The information 
pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, and can be found 
in SR 000 314; "Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect 
of ETSI standards", which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web server 
(http://www.etsi.org/ipr). 

Pursuant to the ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee 
can be given as to the existence of other IPRs not referenced in SR 000 314 (or the updates on the ETSI Web server) 
which are, or may be, or may become, essential to the present document. 



Foreword 



rd , 



This Technical Specification (TS) has been produced by the ETSI 3 Generation Partnership Project (3GPP). 

The present document may refer to technical specifications or reports using their 3GPP identities or GSM identities. 
These should be interpreted as being references to the corresponding ETSI deliverables. The mapping of document 
identities is as follows: 

For 3GPP documents: 

3G TS I TR nn.nnn "<title>" (with or without the prefix 3G) 

is equivalent to 

ETSI TS I TR Inn nnn "[Digital cellular telecommunications system (Phase 2+) (GSM);] Universal Mobile 
Telecommunications System; <title> 

For GSM document identities of type "GSM xx.yy", e.g. GSM 01.04, the corresponding ETSI document identity may be 
found in the Cross Reference List on www.etsi.org/kev 



ETSI 



(3G TS 26.094 version 3.0.0 Release 1 999) 3 ETSI TS 1 26 094 V3.0.0 (2000-01 ) 



Contents 



Foreword 4 

1 Scope 5 

2 Normative References 5 

3 Technical Description of VAD Option 1 5 

3.1 Definitions, symbols and abbreviations 5 

3.1.1 Definitions 5 

3.1.2 Symbols 5 

3.1.2.1 Variables 5 

3.1.2.2 Constants 6 

3.1.2.3 Functions 7 

3.1.3 Abbreviations 8 

3.2 General 8 

3.3 Functional description 8 

3.3.1 Filter bank and computation of sub-band levels 9 

3.3.2 Pitch detection 11 

3.3.3 Tone detection 12 

3.3.4 Correlated Complex Signal Analysis (and detection) 12 

3.3.5 VAD decision 13 

3.3.5.1 Hangover addition 14 

3.3.5.2 Background noise estimation 16 

4 Technical Description of VAD Option 2 18 

4.1 Definitions, symbols and abbreviations 18 

4.1.1 Definitions 18 

4.1.2 Symbols 18 

4.1.2.1 Variables 18 

4.1.2.2 Constants 19 

4.1.2.3 Functions 20 

4.1.3 Abbreviations 21 

4.2 General 21 

4.3 Functional description 21 

4.3.1 Frequency Domain Conversion 22 

4.3.2 Channel Energy Estimator 23 

4.3.3 Channel SNR Estimator 23 

4.3.4 Voice Metric Calculation 23 

4.3.5 Frame SNR and Long-Term Peak SNR Calculation 23 

4.3.6 Negative SNR Sensitivity Bias 24 

4.3.7 VAD Decision 24 

4.3.8 Spectral Deviation Estimator 25 

4.3.9 Sinewave Detection 26 

4.3.10 Background Noise Update Decision 26 

4.3.10 Background Noise Estimate Update 27 

5 Computational details 27 

Annex A (informative) : Change history 28 

History 29 



£75/ 



(3G TS 26.094 version 3.0.0 Release 1 999) 4 ETSI TS 1 26 094 V3.0.0 (2000-01 ) 



Foreword 



rd , 



This Technical Specification has been produced by the 3 Generation Partnership Project, Technical Specification 
Group Services and System Aspects, Working Group 4 (Codec). 

The contents of this informal TS may be subject to continuing work within the 3GPP and may change following formal 
TSG-S4 approval. Should TSG-S4 modify the contents of this TS, it will be re-released with an identifying change of 
release date and an increase in version number as follows: 

Version m.t.e 

where: 

m indicates [major version number] 

X the second digit is incremented for all changes of substance, i.e. technical enhancements, corrections, 
updates, etc. 

y the third digit is incremented when editorial only changes have been incorporated into the specification. 
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1 Scope 



[1] 


TS 26.73 


[2] 


TS 26.90 


[3] 


TS 26.93 



This document specifies two alternatives for tine Voice Activity Detector (VAD) to be used in tine 
Discontinuous Transmission (DTX) as described in [3]. Implementors of mobile station and infrastructure 
equipment conforming to the AMR specifications can choose which of the two VAD options to implement. 
There are no interoperability factors associated with this choice. 

The requirements are mandatory on any VAD to be used either in User Equipment (UE) or Base Station 
Systems (BSS)s that utilize the AMR speech codec. 



2 Normative References 

This TS incorporates by dated and undated reference, provisions from other publications. These normative 
references are cited in the appropriate places in the text and the publications are listed hereafter. For dated 
references, subsequent amendments to or revisions of any of these publications apply to this TS only when 
incorporated in it by amendment or revision. For undated references, the latest edition of the publication 
referred to applies. 

"ANSI-C code for the Adaptive Multi Rate speech codec" . 

"AMR Speech Codec Speech Transcoding Functions" . 

"AMR Speech codec; Source Controlled Rate Operation". 

[4] ITU, The International Telecommunications Union, Blue Book, Vol. Ill, Telephone 

Transmission Quality, IXth Plenary Assembly, Melbourne, 14-25 November, 1988, 
Recommendation G.711, Pulse code modulation (PCM) of voice frequencies. 



3 Technical Description of VAD Option 1 

3.1 Definitions, symbols and abbreviations 

3.1.1 Definitions 

For the purposes of this TS, the following definitions apply: 

frame: Time interval of 20 ms corresponding to the time segmentation of the speech 
transcoder. 

3.1.2 Symbols 

For the purposes of this TS, the following symbols apply. 

3.1.2.1 Variables 

bckr_est[n] background noise estimate 

burst_count counts length of a speech burst, used by VAD hangover addition 
hang_count hangover counter, used by VAD hangover addition 
complex_hang_count hangover counter, used by CAD hangover addition 
complex_hang_timer hangover initator, used fo Complex Activity Estimation 
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lagcount pitch detection counter 

level[n] signal level 

new_speech pointer of the speech encoder, points a buffer containing last received samples of a 
speech frame [2] 

noisejevel average level of the background noise estimate 

oldlagcount lagcount of the previous frame 

pitch flag indicating presence of a periodic signal 

complex_warningflag indicating the presence of a complex signal. 

best_corr_hp normalized and limited value from maximum HP filtered correlation vector 

corr_hp filtered best_corr_hp values 

pow_sum power of the input frame 

s(i) samples of the input framer 

snr_sum measure between input frame and noise estimate 

stat_count stationarity counter 

stat_rat measure indicating stationary 

T_op[n] open-loop lags [2] 

to autocorrelation maxima calculated by the open-loop pitch analysis [2] 

t1 signal power related to the autocorrelation maxima tO [2] 

tone flag indicating the presence of a tone 

vadjhr VAD threshold 

VADJIag boolean VAD flag 

vadreg intermediate VAD decision 

complexjow intermediate complex signal decisions 

complex_high intermediate complex signal decisions 



3.1.2.2 Constants 

ALPHA_UP1 

ALPHA_D0WN1 

ALPHA_UP2 

ALPHA_D0WN2 

ALPHAS 

ALPHA4 

ALPHAS 



constant for updating noise estimate (see subclause 3.3.5.2) 

constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 

constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating average signal level (see subclause 3.3.5.2) 
constant for updating average signal level (see subclause 3.3.5.2) 



BURST_LEN_HIGH_NOiSE constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
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BURST_LEN_LOW_NOISE constant for controlling VAD hangover addition (see subclause 3.3.5.1 ) 
C0EFF3 coefficient for the filter bank (see subclause 3.3.1 ) 

C0EFF5_1 coefficient for the filter bank (see subclause 3.3.1 ) 

COEFF5_2 coefficient for the filter bank (see subclause 3.3.1 ) 

HANG_LEN_HIGH_NOISE constant for controlling VAD hangover addition (see subclause 3.3.5.1 ) 
HANG_LEN_LOW_NOISE constant for controlling VAD hangover addition (see subclause 3.3.5.2) 



HANG_NOISE_THR 

L_FRAME 

L_NEXT 

LTHRESH 

NOISE_MAX 

NOISE_MIN 

NTHRESH 

POW_PITCH_THR 

POW_COMPLEX_THR 

STAT_COUNT 

CAD_MIN_STAT_COUNT 

STAT_THR 

STAT_THR_LEVEL 

TONE_THR 

VAD_P1 

VAD_POW_LOW 

VAD_SLOPE 

VAD THR HIGH 



constant for controlling VAD hangover addition (see subclause 3.3.5.2) 
size of a speech frame, 160 
length for the lookahead of the speech encoder, 40 
threshold for pitch detection (see subclause 3.3.2) 
maximum value for noise estimate (see subclause 3.3.5.2) 
minimum value for noise estimate (see subclause 3.3.5.2) 
threshold for pitch detection (see subclause 3.3.2) 

threshold for pitch detection (see subclause 3.3.5) 

threshold for complex detection (see subclause 3.3.5) 
threshold for stationary detection (see subclause 3.3.5.2) 

minimum threshold after complex warning 
threshold for stationary detection (see subclause 3.3.5.2) 

threshold for stationary detection (see subclause 3.3.5.2) 
threshold for tone detection (see subclause 3.3.3) 
constant of computation for VAD threshold (see subclause 3.3.5.2) 

constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant of computation for VAD threshold (see subclause 3.3.5) 

constant of computation for VAD threshold (see subclause 3.3.5) 
CVAD_THRESH_ADAPT_HIGH constant for updating complex_high 
CVAD_THRESH_ADAPT_LOW constant for updating complexjow 

constant for updating complex_hang_timer 

constant for initiating complex_hang_count 

constant for resetting complex_hang_count 



CVAD. 


_THRESH_HANG 


CVAD. 


_HANG_LIMIT 


CVAD. 


_HANG_LENGTH 


3.1.2.3 


Functions 


+ 


addition 


- 


subtraction 


* 


multiplication 


/ 


division 
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|x| 
AND 


absolute value { 
Boolean AND 


Dfx 


OR 


Boolean OR 




h 







V x(n) = x{a) + x{a + 1) + . . . + x{b -1) + x{b) 



MIN(x,y) 



MAX(x,y) 




3.1.3 Abbreviations 



ANSI 

DTX 

VAD 

CAD 

CNG 



American National Standards Institute 
Discontinuous Transmission 
Voice Activity Detector 
Complex Activity Detection 
Comfort Noise Generation 



3.2 



General 



The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be 
transmitted, i.e. speech, music or information tones. The output of the VAD algorithm is a Boolean flag 
(VAD_flag) indicating presence of such signals. 



3.3 Functional description 



The block diagram of the VAD algorithm is depicted in figure 1 . The VAD algorithm uses parameters of the 
speech encoder to compute the Boolean VAD flag (VAD_flag). Samples of the Input frame (s(i)) are divided 
into sub-bands and level of the signal in each band (level[n]) is calculated. Input for the pitch detection 
function are open-loop lags (T_op[n]), which are calculated by open-loop pitch analysis of the speech 
encoder. The pitch detection function computes a flag (pitch) which indicates presence of pitch. Tone 
detection function calculates a flag (tone), which indicates presence of an information tone. Tones are 
detected based on pitch gain of the open-loop pitch analysis The pitch gain is estimated using 
autocorrelation values (tO and t1) received from the pitch analysis. Complex Signal Detection function 
calculates a flag (complex_warning), which indicates presence of a correlated complex signal such as music. 
Correlate complex signals are detected based on analysis of the correlation vector available in the open- 
loop pitch analysis.The VAD decision function estimates background noise levels. Intermediate VAD decision 
is calculated based on the comparison of the background noise estimate and levels of the input frame 
(level[n]). Finally, the VAD flag is calculated by adding hangover to the intermediate VAD decision. 
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S(i) 



Filter bank 
and 
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Pitch 
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analysis 



VAD flag 
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Figure 3.1. Simplified blocl< diagram of the VAD algorithm: Option 1 

3.3.1 Filter bank and computation of sub-band levels 

The input signal is divided into frequency bands using a 9-band filter bank (figure 3.2). Cut-off frequencies for 
the filter bank are shown in table 3.1 . 
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Table 3.1. Cut-off frequencies for the filter bank 



Band number 


Frequencies 


1 


- 250 Hz 


2 


250 - 500 Hz 


3 


500 - 750 Hz 


4 


750 -1000 Hz 


5 


1000 -1500 Hz 


6 


1500 -2000 Hz 


7 


2000 - 2500 Hz 


8 


2500 - 3000 Hz 


9 


3000 - 4000 Hz 



Input for the filter bank is the speech frame pointed by the new_speech pointer of the speech encoder [1]. 
Input values for the filter bank are scaled down by one bit. This ensures safe scaling, i.e. saturation can not 
occur during calculation of the filter bank. 



5th order 
filter block 



5th order 
filter block 



3rd order 
filter block 



_^ 3k - 4 kHz 



^ 2.5-3 kHz 
-* 2-2.5 kHz 



5th order 
filter block 



3rd order 
filter block 



3rd order 
filter block 



3rd order 
filter block 



3rd order 
filter block 



-*■ 1.5 - 2 kHz 
-* 1-1.5 kHz 

->-750 - 1000 Hz 

-►500 - 750 Hz 

-►250 - 500 Hz 
-► 0-250 Hz 



Figure 3.2. Filter bank 

The filter bank consists of 5**^ and 3'^'' order filter blocks. Each filter block divides the input into high-pass and 
low-pass parts and decimates the sampling frequency by 2. The 5* order filter block is calculated as follows: 



x,^(i) = 0.5*iMx(i-l)) + A,(xii))) 
x,^(i) = 0.5* (A (^0' - D) - A (x(i))) 



(3.1a) 
(3.1b) 



where 



x(i) 



input signal for a filter block 
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Xi^(i) low-pass component 

^hp(^) high-pass component 
The 3'^'^ order filter block is calculated as follows: 

x,^ (i) = 0.5 * (.^0") + 4 (x(i - 1))) (3.2a) 

x,^ (i) = 0.5 * (x(i) - A,(x(i - 1))) (3.2b) 

The filters A^ () , Aj () , and A3 () are first order direct form all-pass filters, whose transfer function is given by: 

C + Z-' 



A(z) = 



l + C*z~' ' 



(3.3) 



where C is the filter coefficient. 



Coefficients for the all-pass filters A^Q ,A^{) , andAjQ are C0EFF5_1, COEFF5_2, and C0EFF3, 
respectively. 



Signal level is calculated at the ouput of the filter bank at each frequency band as follows: 



level(n) = ^ \x„ (i)\ , 



(3.4) 



i=START„ 



where: 



n index for the frequency band 

x^{i) sample i at the output of the filter bank at frequency band n 



START =\ 



END„ 



- 2, n < 4 
-4, 5<n<8 
-8, n=9 

9, n < 4 
19, 5<n<8 
39, n=9 



Negative indices of x^(i) refer to the previous frame. 

3.3.2 Pitch detection 

The purpose of the pitch detection function is to detect vowel sounds and other periodic signals. The pitch 
detection is based on comparison of open-loop lags (T_op[n]), which are calculated by the speech encoder 
[2]. If the difference of consecutive open-loop lags (T_op[n]) is smaller than a threshold, lagcount is 
incremented. If the sum of the lagcounts of two consecutive frames is high enough, the pitch flag is set. For 
5.15 and 4.75 kbit/s rates, only one open-loop lag is calculated, and therfore only the first lag-comparison is 
made every frame. The pitch flag is calculated as follows: 
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Lagcount = 0; 

If ( I T_op[-1] - T_op[0] I < LTHRESH) 

Lagcount = Lagcount + 1 
If ( I T_op[0] - T_op[1] I < LTHRESH) 

Lagcount = Lagcount + 1 
If (Lagcount + oldlagcount > NTHRESH) 

pitch = 1 
else 

pitch = 
oldlagcount = Lagcount 
T_op[-1] refers to the open-loop lag of the previous frame. 



3.3.3 Tone detection 

Tone detection is used to detect information tones, since the pitch detection function can not always detect 
these signals. Also, other signals which contain very strong periodic component are detected, because it 
may sound annoying if these signals are replaced by comfort noise. If the open-loop pitch gain is higher than 
the constant TONE_THR, tone is detected and tone flag is set. The pitch gain can be tested by comparing 
variables tO and t1 as follows: 

if (tO>TONE_THRM1) 

tone = 1 

The speech encoder calculates the pitch in three delay ranges, except for mode 10.2 kbit/s, where only one 
range is used. The above comparison is made once for each delay range and the tone flag should be set if 
the condition is true at least in one range. Otherwise, the tone flag should be set to zero. 

The variables tO and t1 are calculated by the open-loop pitch analysis of the speech encoder [2]. The 
variable tO is autocorrelation maxima given by: 

tO = Y,sJn)sJn-k) (3.5) 

n 

The variable t1 is the signal power related to the autocorrelation maxima tO at the delay value k: 

tl = Y,sl(n-k) (3.6) 

n 

The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for 
modes 5.15 kbit/s and 4.75 kbit/s, where it is computed only once. 

3.3.4 Correlated Complex Signal Analysis (and detection) 

Correlated complex signal detection is used to detect correlated signals in the highpass filtered weighted 
speech domain, since the pitch and tone detection functions can not always detect these signals. Signals 
which contain very strong correlation values in the high pass filtered domain are taken care of, because it 
may sound really annoying if these signals are replaced by comfort noise. If the statistics of the maximum 
normalized correlation value of a high pass filtered input signal indicates the presence of a correlated 
complex signal a flag complex_warning is set. To reduce complexity the high band correlation analysis is 
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performed in a simplified manner by analysing the high pass filtered fullband correlation vector which is 
available from the OL-LTP analysis performed by the speech encoder at least once in each frame. 

best_corr_hprr, is the maximum normalized value of the high pass filtered correlation in the range 1 9-146 
limited to be in the range [1 .0, 0.0]. (Note that the best_corr_hp value is delayed one frame). The high pass 
filter is a simple first order filter with coefficients [1,-1] The best_corr_hp value is filtered according to : 

corr _ hp^^^^^ = (alpha) * corr _ hp^ + (1 - alpha) * best _ corr _ hp^ , 

where alpha is varied between 0.98 and 0.8 as a function of corr_hpm and best_corr_hpm 

The corrhp output value is thresholded into two to registers complex_high, complexjow and one counter 
complex_hang_ timer. 

complexjow is set to 1 if the corr_hp value is greater than CVAD_THRESH_ADAPT_LOW. 

complex_high is set to 1 if the corr_hp value is greater than CVAD_THRESH_ADAPT_HIGH. 

complex_hang_timer \s increased by 1 if the corr_hp value is greater than CVAD_THRESH_HANG. If the 
corr_hpyalue is lower than or equal to CVAD_THRESH_HANG the complex_hang_timer value is set to 0. 

The flag complex_waming is set if complexjow have been set for 15 consecutive frames or complex_high 
has been set for 8 consecutive frames. 

The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for 
modes 5.15 kbit/s and 4.75 kbit/s, where it is computed only once. The computation of the corr_hp value is 
however always done only once per frame using the newest correlation vector available. 

3.3.5 VAD decision 

Power of the input frame is calculated as follows: 

L _ FRAME-L _ NEXT -I 



pow _sum= 2^ 5(/) * 5(/) , (3.7) 



i=-L_NEXT 

where samples s(i) of the input frame are pointed by the new_speech pointer of the speech encoder. If the 
power of the input frame (pow_sum) is lower than the constant POW_PITGH_THR, last pitch flag is set to 
zero. If the power of the input frame (pow_sum) is lower than the constant POW_COMPLEX_THR, last 
complexjow flag is set to zero. 

The difference between the signal levels of the input frame and background noise estimate is calculated as 
follows: 



snr_sum = V MAX (1.0, ^^^^^^"| )^ ^ (3.8) 

bckr _est[n\ 



n=\ 

where: 

level[n] signal level at band n 

bckr_est[n] level of background noise estimate at band n 



VAD decision is made by comparing the variable snr_sum to a threshold. The threshold (vad_thr) is tuned to 
get desired sensitivity at each background noise level. The higher the noise level the lower is the threshold. 
Specially, a low threshold at high-level background noise is needed to detect speech reliably enough, 
although probability of detecting noise as speech also increases. 

Average level of background noise is calculated by adding noise estimates at each band: 
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9 

noise _level = y\bckr _est[n] (3.9) 

«=i 

Threshold is calculated using average noise level as follows: 

vad _ thr = VAD _ SLOPE * (noise _ level - VAD _ PI) + VAD _ THR _ HIGH , (3.10) 

where VAD_SLOPE, VAD_P1 , and VAD_THR_HIGH are constants. 

The variable vadreg indicates intermediate VAD decision and it is calculated as follows: 

if (snr_sum > vad_thr) 

vadreg = 1 
else 

vadreg = 

3.3.5.1 Hangover addition 

Before the final VAD flag is given, a hangover is added. The hangover addition helps to detect low power 
endings of speech bursts, which are subjectively important but difficult to detect. Also a long hangover is 
added if the signal has been found to be of very complex nature for a long time (2 seconds) since the VAD is 
not likely to work reliably for such a complex signal. 

VAD flag is set to "1" if less that hangjen frames with "0" decision have been elapsed since burstjen 
consecutive "1" decisions have been detected. The variables hangjen and burstjen are set depending on 
the average noise level (noisejevel). The vad_flag is also controlled by the complex_hang_count which 
indicates that the signal is too complex for the VAD and should not be used with a Comfort noise generation 
algorithm. The filtered correlation value corr_hp is also used as an activity indication after the VAD has 
indicated noise for a while (during 200 ms), this will aid in situations where the VAD noise estimate has 
adapted to a rather stationary but still all to complex signal to make it sound well with CNG. 

The power of the input frame is compared to a threshold (VAD_POW_LOW). If the power is lower, the VAD 
flag is set to "0" and no hangover is added. The VADJIag is calculated as follows: 

if (noisejevel > HANG_NOISE_THR) 

burstjen = BURST_LEN_HIGH_NOISE 

hangjen = HANG_LEN_HIGH_NOISE 
else 

burstjen = BURST_LEN_LOW_NOISE 

hangjen = HANG_LEN_LOW_NOISE 
if(complex_hangJimer > GVAD_HANG_LIMIT) { 

if(complex_hang_count < CVAD_HANG_LENGTH { 

complex_hang_count = CVAD_HANG_LENGTH; 
} 
} 

if (powsum < VAD_POW_LOW){ 
burst count = 
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hang_count = 
complex_hang_count = 0; 
complex_hang_timer = 0; 
Vad_flag=0; 
Goto Exit; 



} 



VAD_flag=0; 
if(complex_hang_count != 0){ 

burst_count = BURST_LEN_HIGH_NOISE; 

complex_hang_count = complex_hang_count- 1 ; 

VAD_flag=1 ; 

goto Exit 
} else { 

if ( (the 1 last out of 1 1 vadreg values all are zero) AND 
(corr_hp > GVAD_THRESH_IN_NOISE ) ) { 

VADJIag = 1 ; 

Goto Exit 

} 
} 

if (vadreg = 1 ){ 

burst_count = burst_count + 1} 
if (burst_count >= burst_len){ 
hang_count = hangjen 

} 

VADJIag = 1 
} else { 

burst_count = 

if (hang_count > 0){ 

hang_count = hang_count - 1 
VAD_flag=1 
} 
} 
Label Exit 
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3.3.5.2 Background noise estimation 

Background noise estimate (bckr_est[n]) is updated using amplitude levels of the previous frame. Thus, the 
update is delayed by one frame to avoid undetected start of speech bursts to corrupt the noise estimate. If 
the internal VAD decision is "1" or if pitch has been detected, the noise estimate is not updated upwards. The 
update speed for the current frame is selected as follows: 

if ((vadreg for the last 4 frames has been zero) AND 

(pitch for the last 4 frames has been zero) AND 

(we are not in complex signal hangover)) 

alpha_up = ALPHA_UP1 

alpha_down = ALPHA_D0WN1 
else 
if ((stat_count = ) AND (not in complex_signal hangover)) 

alpha_up = ALPHA_UP2 

alpha_down = ALPHA_D0WN2 
else 

alpha_up = 

alpha_down = ALPHAS 

The variable stat_count indicates stationary and its propose is explained later in this subclause. The 
variables alpha_up and alpha_down define the update speed to upwards and downwards. The update speed 
for each band n is selected as follows: 

if ( bckr _ est^ [n\ < level „^_y [n\ ) 

alpha = alpha_up 
else 

alpha = alpha_down 
Finally, noise estimate is updated as follows: 

bckr _ est^^^ [n\ = (1 .0 - alpha) * bckr _ est^ [n\ + alpha * level^_^ \n\, (3.11) 

where: 

n index of the frequency band 

m index of the frame 

Level of the background estimate (bckr_est[n]) is limited between constants NOISEMIN and NOISEMAX. 

If level of background noise increases suddenly, vadreg will be set to "1" and background noise is not 
updated upwards. To recover from this situation, update of the background noise estimate is enabled if the 
intermediate VAD decision (vadreg) is "1" for enough long time and spectrum is stationary. Stationary 
(stat_rat) is estimated using following equation: 

_ ^ MAX (STAT_THR_LEVEL,MAX(flve_/eve/„ [n\level^ [«])) 
Stat -rat-2_^ ^^^ (STAT_THR_LEVEL, MIN(ave _ level ^ [n\ level, „ [n])) 
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If the stationary estimate (stat_rat) is iniginer tinan a tinreshold, tine stationary counter (stat_count) is set to tine 
initial value defined by constant STAT_COUNT. The stationary counter (stat_count) is also initialised if pitch 
or tone or a complex_warning is detected. If the signal is not stationary but speech has been detected (VAD 
decision is "1"), stat_count is decreased by one in each frame until it is zero. 

if {complex_waming){ 

lf(stat_count < CAD_MIN_STAT_COUNT) 

stat_count = < CAD_MIN_STAT_COUNT 

} 

if ( (8 last vadreg flags have been zero) OR (2 last pitch flags have been one) OR (5 last tone flags have 
been one) ) 

stat_count = STAT_COUNT 

else 

if (stat_rat > STAT_THR) 

stat_count = STAT_COUNT 

else 

if ((vadreg) AND (stat_count ^ 0)) 

Stat count = stat count - 1 



The average signal levels (ave_level[n]) are calculated as follows: 

ave _ level ^^^ \n\ =(1.0- alpha) * ave _ level ^ \n\ + alpha * level^ \n\ (3.13) 

The update speed (alpha) for the previous equation is selected as follows: 
if (stat_count = STAT_COUNT) 

alpha = 1.0 
else if (vadreg = 1 ) 

alpha=ALPHA5 
else 

alpha = ALPHA4 
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4 Technical Description of VAD Option 2 

4.1 Definitions, symbols and abbreviations 

4.1.1 Definitions 

For the purposes of the present document, the following definitions apply: 

codec: The combination of an encoder and decoder in series (encoder/decoder). 

compand: The process of compressing and expanding a signal. In this text, the process is described in 
terms of PCM [4]. 

Decoder: Generally, a device for the translation of a signal from a digital representation into an analog 
format. For this standard, a device which converts speech encoded in the format specified in this standard to 
analog or an equivalent PCM representation. 

DFT: See Discrete Fourier Transform. 

Discrete Fourier Transform (DFT): A method of transforming a time domain sequence into a corresponding 
frequency domain sequence. 

Encoder: Generally, a device for the translation of a signal into a digital representation. For this standard, a 
device which converts speech from an analog or its equivalent PCM representation to the digital 
representation described in this standard. 

Fast Fourier Transform (FFT): An efficient implementation of the Discrete Fourier Transform. 

FFT: See Fast Fourier Transform. 

Vocoder: Voice coder. 

frame: Time interval of 20 ms corresponding to the time segmentation of the speech transcoder. 

4.1.2 Symbols 

For the purposes of this TS, the following symbols apply. 

4.1.2.1 Variables 

ach(m) channel energy smoothing factor 

a{m) exponential windowing factor 

AE(m) estimated spectral deviation between current power spectrum and average long term 

power spectral estimate 

(j)(m) spectral peak-to-average ratio 

c'q''' quantized channel SNR indices 

b(m) burst count 

bth burst count threshold 

{d(m)} overlapped portion of the frame buffer of input samples 
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Ech(m) 

EdB(m,i) 

EdB(m) 



channel energy estimate; channel i, subframe m 

vector of channel energy estimates, < i < Nc 

estimated log power spectrum 

vector of log power spectrum estimates, < i < Nc 

average long term power spectral estimate 

vector of average long term power spectral estimates, < i < Nc 

channel noise estimate 

vector of channel noise estimates, < i < Nc 

total estimated noise energy 

total channel energy 

modified total channel energy 

hysteresis counter 

hangover count 

overlap-and-add buffer of samples 

hysteresis counter to avoid long term creeping of upclate_cnt 

last_update_cntpreV\ous value of update_cnt 

Shp(n) sample at the output of the speech encoder high pass filter 

sinewave_flag boolean flag, set TRUE when spectral peak-to-average ratio is greater than 10dB and the 
spectral deviation is less than DEV_THLD 

SNR Signal to Noise ratio 

SNRp(m) long-term peak SNR 

SNRq(m) quantized version of SNRp(m) 

upclate_cnt counter gating noise estimate update process 

update_flag flag controlling noise estimate updating 

VAD(m) boolean VAD flag for subframe m 



En{m,i) 

E.(m) 

Etn(m) 

EUm) 

E'Um) 

h(m) 

hcnt 

ho{n) 
hyster_cnt 



VAD_ 


.flag 


boolean VAD Flag 


v(m) 




sum of voice metrics 


V,h 




voice metric threshold 


4.1.2.2 




Constants 


Uh 




upper limit for values of a(m) 


«/. 




lower limit for values of a{m) 


an 




channel noise smoothing factor 


Cp 




pre-emphasis factor 
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btabie table to generate bth 

D overlap (delay) in sample intervals 

DEV_THLD threshold for setting sinewave_flag 

Efioor low threshold for E,o^m) 

Eh high energy endpoint for linear interpolation of Etot(m) 

Einit minimum allowable channel noise initialisation energy 

El low energy endpoint for linear interpolation of Etot(m) 

^min minimum allowable channel energy 

fw high channel combining table 

h low channel combining table 

g(n) trapezoidal window, n = to M 

G(k) frequency domain transformation of g(n) 

h,abie table to generate hem 

HYSTER_CNT_THLD threshold for hyster_cnt 

L subframe length in samples 

M DFT sequence length 

Nc number of combined channels 

NOISE_FLOOR_D low threshold for Efo^m) in dB 

UPDATE_CNT_THLD threshold for update_cnt 
UPDATE_THLD threshold for v(m) 

V voice metric table 

Viable table to generate v,h 

4.1.2.3 Functions 

+ addition 

subtraction 
* multiplication 

/ division 

UJ largest integer < x 

AND Boolean AND 

OR Boolean OR 



V x(n) = x{a) + x{a + l) + . . . + x{b -l) + x{b) 
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4.1.3 Abbreviations 



ANSI 

DTX 

VAD 

CAD 

CNG 



American National Standards Institute 
Discontinuous Transmission 
Voice Activity Detector 
Complex Activity Detection 
Comfort Noise Generation 



4.2 



General 



The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be 
transmitted, i.e. speech, music or information tones. The output of the VAD algorithm is a Boolean flag 
(VAD_flag) indicating presence of such signals. 



4.3 Functional description 



The block diagram of the VAD algorithm is depicted in figure 4.1 . The VAD algorithm uses parameters of the 
speech encoder to compute the Boolean VAD flag (VAD_flag). 
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Input: 



Figure 4.1. Block Diagram of the VAD algorithm: Option 2 



The output of the High-Pass Filter, {Shp(n)} 

LTPflag is generated by the comparison of the long-term prediction gain to a constant 
threshold LTP_THLD, where the long-term prediction gain p is derived from the speech 
encoder[2] open-loop pitch predictor. 



Output: 
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The output of the vad is designated as VAD_flag 

Initialization: 

The following variables shall be set to zero at initialization (frame m = 0): 

The pre-emphasis memory 

The following shall be initialized to a startup value other than zero: 
The channel energy estimate, Ech(m), (see Section 4.3.2) 

The long-term power spectral estimate, ^de''"' , (see Section 4.3.5) 
The channel noise estimate, En(m), (see Section 4.3.8) 



Processing: The following procedures shall be executed two times per 20 ms speech frame and the current 
10 ms subframe shall be denoted m. 

4.3.1 Frequency Domain Conversion 

The input signal is pre-emphasised and windowed prior to frequency domain conversion. This process is 
defined as 

d(n) = s,/n) + C„s,^(n-l), 0<n<L, (4.1) 

where d(n) is the pre-emphasised speech buffer, ^p is the pre-emphasis factor, and L is the subframe length. 
A rectangular window is then used to frame the speech prior to frequency domain conversion, which is 
expressed as: 

f 0, 0<n<D,L + D<n<M 

[d{n-D), D<n<L + D 

where D is the zero-padding offset into the DFT buffer, and Mis the DFT length. The transformation of g{n) 
to the frequency domain is performed using the Discrete Fourier Transform (DFT) defined"" as: 

2 M-l 

G(k) = — yg(n)e-'^'^"^, 0<k<M (4.3) 

MS 

where e""is a unit amplitude complex phasor with instantaneous radial position co. 



^ This atypical definition is used to exploit the efficiencies of the complex Fast Fourier Transform (FFT). The 2//W scale factor results 
from preconditioning the M point real sequence to form an M/2 point complex sequence that is transformed using an M/2 point 
complex FFT. Details on this technique can be found in Proakis, J. G. and Manolakis, D. G., Introduction to Digital Signal 
Processing, New York, Macmillan, 1988, pp. 721-722. 
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4.3.2 Channel Energy Estimator 

Calculate the channel energy estimate Ec/,(m) for the current subframe, m, as: 

£,, {m, i) = max £„,„ , a,, {m)E^, {m - 1, i) + (l - a,, (m)) ^ £ |G(«|' L 0</< Af,(4.4) 

[ /«0)-AO) + l*=A(,) J 

where Emin is the minimum allowable channel energy, ach{m) is the channel energy smoothing factor (defined 
below), A/c is the number of combined channels, and /i.(/) and /h(/) are the /-th elements of the respective low 
and high channel combining tables. 

The channel energy smoothing factor, ach(m), is defined as: 

f 0, m<l 
[0.45, m > 1 

So, this means that ach{m) assumes a value of zero for the first frame (m = 1 ) and a value of 0.45 for all 
subsequent frames. This allows the channel energy estimate to be initialized to the unfiltered channel 
energy of the first frame. 

4.3.3 Channel SNR Estimator 

Estimate the channel SNR vector { cr} as: 



a(/) = 101og, 



^E.(m,i)^ 



E„(m,i) 



, 0<i<N^ (4.6) 



where En(m) is the current channel noise energy estimate (see Section 4.3.8), and then quantify the channel 
SNR estimate in 3/8 dB steps to yield the channel SNR indices { Oq} given as: 

o^ (i) = max{0, min{89, round{a(/) / 0.375}}}, < / < A^^ (4.7) 

where the values of { Oq} are constrained to be between and 89, inclusive. 

4.3.4 Voice Metric Calculation 

Next, calculate the sum of voice metrics as: 

v(m)=£y(a,(0), (4.8) 

where V{k) is the /<*^ value of the 90 element voice metric table V. 

4.3.5 Frame SNR and Long-Term Peak SNR Calculation 

The instantaneous frame SNR, SNR, and long-term peak SNR, SNRp{m), are used to calibrate the 
responsiveness of the VAD decision. When the frame count is less than or equal to four (m < 4) or the 
forced update flag (sec 4.3.1 0) is set (fupdate_flag == TRUE), then the SNR's are initialized as: 
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f N-\ 



SNR^ (m) = SNR = 56 - lOlogK 



£^„('«.0 



1=0 



(4.9) 



Otherwise, the instantaneous frame SNR is generated by: 



/ 1 N^-l 



SNR = 10\og,, 
and the long-term peak SNR is derived by the following expression: 



(4.10) 



SNR(m) = 



0.9SNR^, (m - 1) + 0. ISNR, SNR > SNR^ (m - 1) 

0.9985A^i?^ (m - 1) + 0.002SNR, 0.625SNR^, (m - 1) < SNR < SNR^ (m - 1) 

otherwise 

(4.11) 



SNR^(m-l), 



The long-term peak SNR is then quantized in 3 dB steps and limited to be between and 19, as follows: 

SNR^ = max{min|_5A^7?^, (m) / 3 J 1 9 } j (4.12) 

where M is the largest integer < x (floor function). 

4.3.6 Negative SNR Sensitivity Bias 

in order for the VAD decision to overcome the problem of being over-sensitive to fluctuating, non-stationary 
background noise conditions, a bias factor is used to increase the threshold on which the VAD decision is 
based. This bias factor is derived from an estimate of the variablility of the background noise estimate. The 
variability estimate is further based on negative values of the instantaneous SNR. It is presumed that a 
negative SNR can only occur as a result of fluctuating background noise, and not from the presence of voice. 
Therefore, the bias factor /j.(m) is derived by first calculating the variability factor i//(m) as: 



y/(m) 



\0.99\l/(m-l) + 0mSNR\ SNR<0 
I y/(m — 1) otherwise 



(4.13) 



which is then clamped in magnitude to < V^(m) < 4.0 . in addition, the variability factor is reset to zero 

when the frame count is less than or equal to four (m < 4) or the forced update flag (sec 4.3.1 0) is set 
(fupdate_flag == TRUE). The bias factor ^(m) is then calculated as: 



/l(m) = max{l2.0(v/(m) -0.65), O} 



(4.14) 



4.3.7 VAD Decision 

The quantized SNR SNRq is used to determine the respective voice metric threshold Vfh, hangover count /?cnf, 
and burst count threshold bth parameters: 



.(SNR^l K„,=h,^JSNRj b„=b,^JSNRj 



(4.15) 
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where SNRq is the index of the respective table elements. The VAD decision can then be made according 
to the following pseudocode: 

if ( v{m) > V(h + ix{m)) { r if the voice metric > voice metric threshold*/ 
VAD{m) = ON 

b{m) = b{m-^ ) + 1 /* increment burst counter 7 

if ( b{m) > bfh) { r compare counter with threshold 7 

h{m) = hcnt I* set hangover 7 

} 
} else { 

b{m) = /* clear burst counter 7 

h{m) = h{m-^ ) -1 /* decrement hangover / 

if ( h{m) <= ) { /* check for expired hangover / 

VAD{m) = OFF 
h{m) = 
} else { 

VAD{m) = ON /* hangover not yet expired 7 

} 
} 

Note that two 10 ms subframes are required to determine one VAD decision. The final decision is 
determined by the maximum of two subframe decisions, i.e. 

\^VAD{m) == ON OR VAD{m-1) == ON) { 

VADJIag = TRUE 
} else { 

VADJIag = FALSE 
} 

4.3.8 Spectral Deviation Estimator 

The spectral deviation estimator is used as a safeguard against erroneous updates of the background noise 
estimate. If the spectral deviation of the input signal is too high, then the background noise estimate update 
may not be permitted. Calculate the estimated log power spectrum as: 

E,,(m,i) = lO\og,,{E^,(m,i)), 0<i<N^ (4.16) 



Then, calculate the estimated spectral deviation between the current power spectrum and the average long- 
term power spectral estimate: 

TV,-! 

^E(fn)=Y,\EjB(m,i)-Ejs(m,i)\ (4.17) 

1=0 

where E^^(m) is the average long-term power spectral estimate calculated during the previous subframe, 

as defined in Equation 4.20. The initial value of E^^(m) , however, is defined to be the estimated log power 
spectrum of subframe 1 , or: 

E,,(m) = E,,(m), m = \ (4.18) 



The exponential windowing factor, a{m), is then calculated as a function of the instantaneous frame SNR 
SNR and the long-term peak SNR SNRp{m), as: 
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a{m) =a^ - 



^ . .SNR(m)-SNR^ 

(a„ -a, ) 

" "• SNR(m) 



(4.19) 



which is then limited to a^ < a(m) < a^ . 

The average long-term power spectral estimate is then updated for the next frame by: 

E^g(m + l,i)=a(m)Ejg(m,i) + (l-a(m))E^g(m,i), 0<i<N^ (4.20) 

where all the variables are previously defined. 

4.3.9 Sinewave Detection 

Next the sinewave_flag is set TRUE when the spectral peak-to-average ratio 0(m) is greater than 10, i.e. 



sinewave Jlag = 



J TRUE, 0(m) > 10 
[false, otherwise 



(4.21) 



where: 



/ 



0(m) = lOlog 



10 



max 



:{E^^(m,i)} 



£;:;£,,(m,j)/A^, 



2<i<N^ 



(4.22) 



4.3.10 Background Noise Update Decision 

The following logic, as shown in pseudo-code, demonstrates how the noise estimate update decision is 
ultimately made: 

/* Normal update logic 7 

updatejiag = fupdate_flag = FALSE 

if ( v{m) < UPDATE_THLD and b{m) == ) { 

updatejiag = TRUE 

update_cnt = 
} 

/* Forced update logic (for over-riding the normal update logic)*/ 
else if (( E,a,> NOISE_FLOOR) and ( A£(m) < DEV_THLD ) 
and ( sinewave_flag == FALSE ) and {LTPJIag == FALSE)) { 

update_cnt= update_cnt+ 1 

if ( update_cnt> UPDATE_GNT_THLD ) 
updatejiag = fupdatejiag = TRUE 
} 

/* "Hysteresis" logic to prevent long-term creeping of update_cnt 7 
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if ( upclate_cnt == last_upclate_cnt ) 
hyster_cnt = hyster_cnt + 1 

else 

hyster_cnt = 

last_update_cnt = update_cnt 

if ( hyster_cnt> HYSTER_CNT_THLD ) 
update_cnt = 



where Etot is tine total channel energy defined as: 



N-l 



K, = Y.^cM,i) (4.23) 

!=0 

and LTPflag is generated by the comparison of the long-term prediction gain to a constant threshold 
LTP_THLD, i.e.: 

f TRUE, B > LTP THLD 
LTP_flag = \ ^ ~ (4.24) 

[FALSE, otherwise 

where the long-term prediction gain p is derived from the speech encoder [2] open-loop pitch predictor, and 
can be expressed as: 

where sj,n) is the weighted speech, k is the optimal open-loop lag, and Np is the pitch analysis frame length. 
This expression is calculated in the speech encoder on the previous frame. 

4.3.10 Background Noise Estimate Update 

if (and only if) the update flag is set (update_flag == TRUE), then update the channel noise estimate for the 
next subframe by: 

E„(m + U) = max{E^^,a„E„(m,i) + il-a„)E^,(m,i)l 0<i<N^ (4.26) 

where Emin is the minimum allowable channel energy, and «„ is the channel noise smoothing factor. The 
channel noise estimate shall be initialized for each of the first four frames to the estimated channel energy, 
i.e.: 

E„(m,i) = max{E.„,,E^,(m,i)l m<4, 0</<A^,, (4.27) 

where £;„,( is the minimum allowable channel noise initialization energy. 



5 Computational details 

A low level description has been prepared in form of ANSI C source code [1]. 
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